library(tidyverse)
library(infer)
library(ggridges)
library(broom)Lab 9 – One-Way ANOVA
Your group’s names here!
June 2, 2023
Today’s Data
These data come from the Gapminder Foundation, an organization interested in increasing the use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels.
Today we will be comparing math achievement scores across continents and years. Math achievement was measured for 42 countries based on their average score for the grade 8 international TIMSS test.
math_scores <- read_csv(here::here("labs",
"data",
"math_scores.csv")
)
# Creating a year_cat variable that is the categorical version of year
math_scores <- mutate(math_scores,
year_cat = as.factor(year)
)
# Removing the missing values from the grade_8_math_score variable
math_scores <- drop_na(data = math_scores,
grade_8_math_score)Data Visualizations
The first step for a statistical analysis should always be creating visualizations of the data. Similar to what you are expected to do for your project, you will make three density ridge plots:
- visualizing the relationship between math score and year
- visualizing the relationship between math score and continent
- visualizing the relationship between math score with both year and continent
Plan for Week 9
Asynchronous class on Tuesday and Thursday
Typical deadlines for reading (Tuesday) and tutorial (Thursday)
“Checkpoints” for Final Project incorporated throughout the week
- Introduction – Due Wednesday
- Methods – Due Friday
- Findings & Scope of Inference – Due Sunday
Some advice on the your Final Project…
What did we do on Tuesday?
We carried out a hypothesis test!
\[H_0: \beta_1 = 0\]
\[H_A: \beta_1 \neq 0\]
What do these hypotheses mean in words?
By creating a permutation distribution!
What is happening in the generate() step?
And visualizing where our observed statistic fell on the distribution
What would you estimate the p-value to be?
And calculated the p-value
How would this process have changed if we used theory-based methods instead?
Approximating the permutation distribution
A \(t\)-distribution can be a reasonable approximation for the permutation distribution if certain conditions are not violated.
What about the observed statistic?
Response: score (numeric)
Explanatory: bty_avg (numeric)
# A tibble: 1 × 1
stat
<dbl>
1 0.0666
How did R calculate the \(t\)-statistic?
\(SE_{b_1} = \frac{\frac{s_y}{s_x} \cdot \sqrt{1 - r^2}}{\sqrt{n - 2}}\)
[1] 0.01495204
\(t = \frac{b_1}{SE_{b_1}}\)
bty_avg
4.45672
# A tibble: 2 × 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 3.88 0.076 51.0 0 3.73 4.03
2 bty_avg 0.067 0.016 4.09 0 0.035 0.099
How does R calculate the p-value?
How many degrees of freedom does this \(t\)-distribution have?
Did we get similar results between these methods?
Why not always use theoretical methods?
Theory-based methods only hold if the sampling distribution is normally shaped.
The normality of a sampling distribution depends heavily on model conditions.
What are these “conditions”?
For linear regression we are assuming…
Linear relationship between \(x\) and \(y\)
Indepdent observations
Normality of residuals
Equal variance of residuals
Linear relationship between \(x\) and \(y\)
What should we do?
Variable transformation!
Independence of observations
The evals dataset contains 463 observations on 94 professors. Meaning, professors have multiple observations.
What can we do?
Best – use a random effects model
Reasonable – collapse the multiple scores into a single score
Normality of residuals
What should we do?
Variable transformation!
Equal variance of residuals
What should we do?
Variable transformation!
Are these conditions required for both methods?
Simulation-based Methods
Question 1 – Fill in the code below to visualize the distribution of grade 8 math scores over time.
Don’t forget to include axis labels!
Note: I’ve included a scale = 1 argument to show you how you can get the density plots not to overlap!
Question 2 – What do you see in the plot you made? How do the centers (means) of the distributions compare? What about the variability (spread) of the distributions?
Question 3 – Write the code to visualize the distribution of grade 8 math scores for the six different continents.
Don’t forget to include axis labels!
Question 4 – What do you see in the plot you made? How do the centers (means) of the distributions compare? What about the variability (spread) of the distributions?
Question 5 – Write the code to visualize the distribution of grade 8 math scores for the six different continents for each of the four years.
Remember, you could either include a facet or a color here!Also remember you can use alpha to change the transparency of your density ridges!
Question 6 – What do you see in the plot you made? Does it seem that the relationship between year and grade 8 math scores changes based on the continent of the student?
Statistical Model
For our analysis we will be using an analysis of variance (ANOVA) model. An ANOVA is an appropriate statistical model as we have a continuous response variable (grade 8 math score) and categorical explanatory variables (year, continent). Year is not considered to be a continuous numerical variable as we have only four measurements in time (1996, 1999, 2003, 2007).
Model Conditions
An ANOVA has model conditions that are very similar to what we learned for linear regression. In this section we will evaluate the conditions of the model.
For this section, it might be helpful to know how many observations there are for each year and for each continent. I have written code below to provide you with a table of these numbers:
Independence
Based on the table we know:
>>>>>>> 480a4b3c6f7f14f0c08a28ce007e781ac293a1c0:docs/labs/lab-9.html- each year has measurements on about six continents
- each continent has measurements for about four years
Theory-based Methods
- Linearity of Relationship
- Independence of Observations
- Normality of Residuals
- Equal Variance of Residuals
What happens if the conditions are violated?
In general, when the conditions associated with these methods are violated, the permutation and \(t\)-distributions will underestimate the true standard error of the sampling distribution.
=======Use this information to evaluate the condition of independence of observations.
Question 7 – Is it reasonable to assume that the observations within a continent are independent of each other?
Question 8 – Is it reasonable to assume that the observations within a year are independent of each other?
Question 9 – Is it reasonable to assume that the observations between continents are independent of each other?
Question 10 – Is it reasonable to assume that the observations between a years are independent of each other?
Normality
Now we will evaluate the normality of the the distributions of grade 8 math scores across years and across continents – the plot you created in #5. Keep in mind, the normality condition is very important when the sample sizes for each group are relatively small.
Question 11 – Is it reasonable to say that the grade 8 math scores across the four years and six continents are normally distributed?
Equal Variance
Now we will evaluate the normality of the the distributions of grade 8 math scores across years and across continents – the plot you created in #5. Keep in mind, the constant variance condition is especially important when the sample sizes differ between groups.
For this section, it might be helpful to know the standard deviations for each year / continent combo. I have written code below to provide you with a table of these numbers:
Keep in mind a standard deviation of NA can happen for two reasons, (1) there is no data, or (2) there is only one observation.
Looking at the table, we can see that the largest variance of 10257 (North America, 2007) is nearly 27 times larger than the smallest variance of 381 (Europe, 2003). That’s a lot! So, our equal variance condition is definitely violated.
But, we have learned tools to attempt to remedy this issue! Let’s take the log of grade_8_math_score and see how the variances compare.
Question 12 – Based on the variances in the table above, is it reasonable to say that the log grade 8 math scores across the four years and six continents have equal variability?
One-Way ANOVA Inference
We are going to test out both methods for conducting a hypothesis test for an ANOVA – theory-based and simulation-based methods. Keep in mind both methods require independence of observations and equal variability. Normality, however, is only a condition of theory-based methods.
Testing for a Difference Between Years
Since the distribution of grade 8 math scores across the four years wasn’t horribly not Normal, let’s give a theory-based method a try.
Question 13 – Fill in the code below to conduct a one-way ANOVA modeling the relationship between mean grade 8 math score and the year
Keep in mind the response variable comes first and the explanatory variable comes second!
Question 14 – At an \(\alpha = 0.1\), what decision would you reach for your hypothesis test?
Question 15 – What would you conclude about the relationship between the mean grade 8 math scores and year?
Testing for a Difference Between Continents
Since the distribution of grade 8 math scores across the six continents didn’t look very Normal, so let’s give a simulation-based method a try.
I’ve gotten you started by calculating the observed F-statistic for the relationship between a country’s grade 8 math score and its continent.
Question 16 – Write the code to generate a permutation distribution of resampled F-statistics.
Question 17 – Visualize the null distribution and shade how the p-value should be calculated
Keep in mind you only look at the right tail for an ANOVA!
Question 18 – Calculate the p-value for the observed F-statistic
Question 19 – At an \(\alpha = 0.1\), what decision would you reach for your hypothesis test?
Question 20 – What would you conclude about the relationship between the mean grade 8 math scores and continent?
>>>>>>> 480a4b3c6f7f14f0c08a28ce007e781ac293a1c0:docs/labs/lab-9.html